Attention Eclipse: Manipulating Attention to Bypass LLM Safety-Alignment

Pedram Zaree; Md Abdullah Al Mamun; Quazi Mishkatul Alam; Yue Dong; Ihsen Alouani; Nael Abu-Ghazaleh

doi:10.18653/v1/2025.emnlp-main.842

Attention Eclipse: Manipulating Attention to Bypass LLM Safety-Alignment

Pedram Zaree, Md Abdullah Al Mamun, Quazi Mishkatul Alam, Yue Dong, Ihsen Alouani, Nael Abu-Ghazaleh

Abstract

Recent research has shown that carefully crafted jailbreak inputs can induce large language models to produce harmful outputs, despite safety measures such as alignment. It is important to anticipate the range of potential Jailbreak attacks to guide effective defenses and accurate assessment of model safety. In this paper, we present a new approach for generating highly effective Jailbreak attacks that manipulate the attention of the model to selectively strengthen or weaken attention among different parts of the prompt. By harnessing attention loss, we develop more effective jailbreak attacks, that are also transferrable. The attacks amplify the success rate of existing Jailbreak algorithms, including GCG, AutoDAN, and ReNeLLM, while lowering their generation cost (for example, the amplified GCG attack achieves 91.2% ASR, vs. 67.9% for the original attack on Llama2-7B-chat/AdvBench, using less than a third of the generation time).

Anthology ID:: 2025.emnlp-main.842
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 16648–16668
Language:
URL:: https://aclanthology.org/2025.emnlp-main.842/
DOI:: 10.18653/v1/2025.emnlp-main.842
Bibkey:
Cite (ACL):: Pedram Zaree, Md Abdullah Al Mamun, Quazi Mishkatul Alam, Yue Dong, Ihsen Alouani, and Nael Abu-Ghazaleh. 2025. Attention Eclipse: Manipulating Attention to Bypass LLM Safety-Alignment. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 16648–16668, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Attention Eclipse: Manipulating Attention to Bypass LLM Safety-Alignment (Zaree et al., EMNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.emnlp-main.842.pdf
Checklist:: 2025.emnlp-main.842.checklist.pdf

PDF Cite Search Checklist Fix data